CIE: Add an SSE2 version of "RGBA float" to "CIE Lab alpha float"
On an Intel i7 Haswell, it now takes 0.13s to convert a 15 megapixel
buffer from "RGBA float" to "CIE Lab alpha float" instead of the
earlier 0.27s.
SSEx doesn't have integer multiplication or division operations, and
using bit shifts to implement integer divisions by powers of 2 seems to
introduce errors. Therefore, it was problematic to use the cube root
approximation from Hacker's Delight, which uses quite a few integer
divisions to make the initial guess. Instead, Halley's method of
approximating the cube root seems more SSEx friendly because the
initial guess requires only one integer division, which we can manage
by jumping through a relatively small number of hoops.
The scalar version of Halley's method seems to have originated from
http://metamerist.com/cbrt/cbrt.htm but that's not accessible anymore.
At present there's a copy in CubeRoot.cpp in the Skia sources that's
licensed under a BSD-style license. There's some discussion on the
implementation at http://www.voidcn.com/article/p-gpwztojr-wt.html.
Note that Darktable also has an SSE2 version of the same algorithm,
but uses only a single iteration of Halley's method, which is too
coarse.
Here's some more discussion on the cube root approximation algorithms:
https://bugzilla.gnome.org/show_bug.cgi?id=791837
https://bugzilla.gnome.org/show_bug.cgi?id=795686